3 Exploratory Data Analysis (EDA)

The image above is from R4DS(2e) by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund
The image above is from R4DS(2e) by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  1. Generate questions about your data

  2. Search for answers by visualising, transforming, and/or modeling your data

  3. Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)

3.1 GDP and GDP per Capita

  1. GDP, PPP (constant 2017 international $): NY.GDP.MKTP.PP.KD

  2. Population, total: SP.POP.TOTL

  3. Calculate GDP per Capita

    • GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
  • GDP, PPP (constant 2017 international $) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD

  • Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL

df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_gdppcap, "data/gdppcap.csv")
df_gdppcap <- read_csv("data/gdppcap.csv")
Rows: 16758 Columns: 15── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): country, iso2c, iso3c, region, capital, income, lending
dbl  (6): year, gdp, pop, gdppcap, longitude, latitude
lgl  (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df_gdppcap)
spc_tbl_ [16,758 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country    : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ iso2c      : chr [1:16758] "AF" "AF" "AF" "AF" ...
 $ iso3c      : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
 $ year       : num [1:16758] 2014 2012 2009 2013 1971 ...
 $ status     : logi [1:16758] NA NA NA NA NA NA ...
 $ lastupdated: Date[1:16758], format: "2023-09-19" "2023-09-19" "2023-09-19" "2023-09-19" ...
 $ gdp        : num [1:16758] 7.02e+10 6.47e+10 4.99e+10 6.83e+10 NA ...
 $ pop        : num [1:16758] 32716210 30466479 27385307 31541209 11015857 ...
 $ gdppcap    : num [1:16758] 2144 2123 1824 2165 NA ...
 $ region     : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
 $ capital    : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
 $ longitude  : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
 $ latitude   : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
 $ income     : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
 $ lending    : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   iso2c = col_character(),
  ..   iso3c = col_character(),
  ..   year = col_double(),
  ..   status = col_logical(),
  ..   lastupdated = col_date(format = ""),
  ..   gdp = col_double(),
  ..   pop = col_double(),
  ..   gdppcap = col_double(),
  ..   region = col_character(),
  ..   capital = col_character(),
  ..   longitude = col_double(),
  ..   latitude = col_double(),
  ..   income = col_character(),
  ..   lending = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
df_gdppcap |> select(region, income, lending) |> lapply(unique)
$region
[1] "South Asia"                 "Aggregates"                 "Europe & Central Asia"      "Middle East & North Africa"
[5] "East Asia & Pacific"        "Sub-Saharan Africa"         "Latin America & Caribbean"  "North America"             
[9] NA                          

$income
[1] "Low income"          "Aggregates"          "Upper middle income" "Lower middle income" "High income"        
[6] NA                    "Not classified"     

$lending
[1] "IDA"            "Aggregates"     "IBRD"           "Not classified" "Blend"          NA              
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
  ggplot(aes(year, gdppcap)) + geom_line()

COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
  ggplot(aes(year, pop)) + geom_line()

3.1.0.1 Exercise.

Write your observations and questions.

3.1.1 GDP Per Capita

df_gdppcap2 <- df_gdppcap |> drop_na(pop) |> 
  mutate(PCAP = gdp/pop, .after = gdppcap)
df_gdppcap2

3.1.1.1 Check against GDP per capita, PPP

df_gdppcap2 |> drop_na(gdppcap, PCAP) |> mutate(near = near(gdppcap, PCAP)) |> 
  summarize(numberofdata = n(), sum(near))
df_gdppcap2 |> filter(!near(gdppcap, PCAP))

3.1.1.2 Exercise.

Write your observations and questions.

3.1.2 Visualization

Two useful questions.

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

See Link.

3.1.2.1 Ranks.

arrange(desc(gdp)) is to reorder in descending order of gdp, arrange(gdp) in ascending order.

df_gdppcap |> filter(year == 2022, region != "Aggregates") |> 
  drop_na(gdp) |> arrange(desc(gdp))

3.1.2.2 Exercises.

  1. Find the top 10 of the countries with the highest GDP per capita.

  2. Find the top 10 of the countries with the lowest GDP per capita.

  3. Find the top 10 of the countries with the largest population.

  4. Find the top 10 of the countries with the smallest population.

3.1.2.3 Scatter Plot

What type of covariation occurs between my variables?

df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp)) + geom_point()

df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp)) + geom_point() + 
  scale_x_log10() + scale_y_log10()

df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() + scale_y_log10()

df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp, color = region)) + geom_point() + 
  scale_x_log10() + scale_y_log10()

df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp, color = region, shape = income)) + geom_point() + 
  scale_x_log10() + scale_y_log10()

df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> 
  drop_na(gdp, gdppcap, pop) |> 
  ggplot(aes(gdppcap, gdp, color = region, size = pop)) + geom_point() + 
  scale_x_log10() + scale_y_log10()

install.packages("plotly")
library(plotly)
test <- df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |> 
  ggplot(aes(color = country, shape = region, pop, gdp)) + geom_point() + 
  scale_x_log10() + scale_y_log10() + theme(legend.position = "none")
test |> ggplotly()
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to
discriminate; you have 7. Consider specifying shapes manually if you must have them.

3.1.2.4 Variation 1. Histogram

df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |> 
  ggplot(aes(gdp)) + geom_histogram()

df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap)) + geom_histogram()

df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |> 
  ggplot(aes(gdp)) + geom_histogram() + scale_x_log10()

3.1.2.5 Exercises.

  1. Change bins, i.e., geom_histogram(bins = 20), etc.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |> 
  ggplot(aes(gdp)) + geom_histogram(bins = 20) + scale_x_log10()
  1. Create a similar histogram of gdppcap by using scale_x_log10() and adjust the number of bins.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap)) + geom_histogram() + scale_x_log10()
  1. Create a similar histogram for population.
  2. Write your observations and comments.

3.1.2.6 Extra.

df_gdppcap |> filter(year == 2022,region != "Aggregates") |> drop_na(pop) |> 
  group_by(region) |> 
  ggplot(aes(pop, fill = region)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()

3.1.2.7 Variation 2. Boxplot

df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10()

df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10() +
  labs(title = "Distribution of the GDP per Capita of Countries", subtitle = "Year 1990, 2000, 2010, 2020", 
       y = "Year", x = "GDP per capita in log10 scale")

df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |> 
  filter(income != "Aggregates") |> 
  ggplot(aes(gdppcap, income, fill = income)) + geom_boxplot() + scale_x_log10() +
  theme(legend.position = "none")

df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |> 
  filter(income != "Aggregates") |> 
  ggplot(aes(gdppcap, factor(income, levels = c("High income", "Upper middle income", "Lower middle income", "Low income")), fill = income)) + geom_boxplot() + scale_x_log10() +
  labs(y = "") +
  theme(legend.position = "none")

df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |> 
  filter(income != "Aggregates") |> 
  ggplot(aes(gdp, region, fill = region)) + geom_boxplot() + scale_x_log10() +
  theme(legend.position = "none")

3.2 CO2 Emissions Per Capita vs GDP Per Capita

  1. CO2 emissions (metric tons per capita): EN.ATM.CO2E.PC

  2. GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD

  • CO2 emissions (metric tons per capita) Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring. EN.ATM.CO2E.PC

  • GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD

3.2.1 Importing Data

df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_co2gdp, "data/co2gdp.csv")
df_co2gdp <- read_csv("data/co2gdp.csv")
Rows: 16758 Columns: 14── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): country, iso2c, iso3c, region, capital, income, lending
dbl  (5): year, co2pcap, gdppcap, longitude, latitude
lgl  (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.2.2 Visualization by Line Graphs

COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |>
  ggplot(aes(year, co2pcap)) + geom_line()

ISO2C <- c("JP", "CN", "ID", "UK", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
  ggplot(aes(year, co2pcap, linetype = iso2c)) + geom_line()

3.2.2.1 Exercises.

  1. Change iso2c codes to those you want to investigate. Use df_codes under Environment
  2. Change linetype to col.

3.2.3 Scatterplot for Covariation

df_co2gdp |> filter(year == 2020) |> drop_na(co2pcap) |>
  ggplot(aes(gdppcap, co2pcap)) + geom_point()

df_co2gdp |> filter(year == 2020) |> 
  drop_na(gdppcap, co2pcap) |>
  ggplot(aes(gdppcap, co2pcap)) + geom_point() +
  scale_x_log10() + scale_y_log10()

3.2.3.1 Scatterplot with a regression line

df_co2gdp |> filter(year == 2020) |> 
  drop_na(gdppcap, co2pcap) |>
  ggplot(aes(gdppcap, co2pcap)) + geom_point() +
  geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
  scale_x_log10() + scale_y_log10()

3.2.3.2 Summary of a linear model

df_co2gdp |> filter(year == 2020) |> drop_na(gdppcap, co2pcap) |>
  lm(log10(co2pcap)~log10(gdppcap), data = _) |> summary()

Call:
lm(formula = log10(co2pcap) ~ log10(gdppcap), data = drop_na(filter(df_co2gdp, 
    year == 2020), gdppcap, co2pcap))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60778 -0.15660 -0.00651  0.16129  0.59437 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -4.31545    0.13386  -32.24   <2e-16 ***
log10(gdppcap)  1.13831    0.03288   34.62   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2362 on 228 degrees of freedom
Multiple R-squared:  0.8402,    Adjusted R-squared:  0.8395 
F-statistic:  1199 on 1 and 228 DF,  p-value: < 2.2e-16

3.3 School Enrollment vs GDP Per Capita

  1. School enrollment, secondary (% gross): SE.SEC.ENRR

  2. GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD

  • School enrollment, secondary (% gross) Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers. SE.SEC.ENRR

  • GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD

3.3.1 Importing Data

df_secgdp <- WDI(indicator = c(sec = "SE.SEC.ENRR", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_secgdp, "data/secgdp.csv")
df_secgdp <- read_csv("data/secgdp.csv")
Rows: 16758 Columns: 14── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): country, iso2c, iso3c, region, capital, income, lending
dbl  (5): year, sec, gdppcap, longitude, latitude
lgl  (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.3.2 Visualization by Line Graphs

COUNTRY <- "World"
df_secgdp |> filter(country == COUNTRY) |>
  ggplot(aes(year, sec)) + geom_line()

COUNTRIES <- c("Low income", "Low & middle income", "Lower middle income", "Middle income", "Upper middle income", "High income")
df_secgdp |> filter(country %in% COUNTRIES) |> drop_na(sec) |>
  ggplot(aes(year, sec, linetype = factor(country, levels = COUNTRIES))) + geom_line() +
  labs(linetype = "Income Levels")

3.3.2.1 Exercise.

Change COUNTRIES to ISO2C of countries you want to investigate. Use df_codes under Environment

df_secgdp |> filter(year == 2020) |> drop_na(sec) |>
  ggplot(aes(gdppcap, sec)) + geom_point()

df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
  ggplot(aes(gdppcap, sec)) + geom_point() +
  scale_x_log10()

df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
  ggplot(aes(gdppcap, sec)) + geom_point() +
  geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
  scale_x_log10()

df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
  lm(sec~log10(gdppcap), data = _) |> summary()

Call:
lm(formula = sec ~ log10(gdppcap), data = drop_na(filter(df_secgdp, 
    year == 2020), gdppcap, sec))

Residuals:
    Min      1Q  Median      3Q     Max 
-53.777 -10.846  -1.173   9.006  66.996 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -102.994     11.933  -8.631 6.38e-15 ***
log10(gdppcap)   46.088      2.841  16.222  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.64 on 157 degrees of freedom
Multiple R-squared:  0.6263,    Adjusted R-squared:  0.624 
F-statistic: 263.2 on 1 and 157 DF,  p-value: < 2.2e-16

4 Your Turn

4.1 Exercises.

Do a similar investigation by selecting WDI codes.

4.1.1 WDI Code

Choose at least two WDI codes with their names

  1. Name: Code:

  2. Name: Code:

4.1.2 Importing Data

Replace the following data_frame_name and shortname1, shortname2.

df_dataframe_name <- WDI(indicator = c(shortname1 = "", shortname2 = "", extra = TRUE))
write_csv(df_dataframe_name, "data/dataframe_name.csv")
df_dataframe_name <- read_csv("data/dataframe_name.csv")

4.1.3 Viewing Data

head(), str(), summary(), and try df_dataframe_name

4.1.4 Visualization

Try as many visualization as possible.

  • rank for each variable

  • line graph

  • scatterplot

  • scatterplot with a regression line

  • histogram

  • boxplot

4.2 References

  1. R for Data Science (2e): Link. The First Edition: Link.

  2. Posit Primers: Link.

  3. Cheat Sheet: Link.

---
title: "Introduction to WDI, Part II"
date: "`r Sys.Date()`"
output:
  html_notebook:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_float: yes
    pandoc_args: --number-offset=2
  word_document:
    toc: yes
    reference_docx: intro2wdi_tmp.docx
  pdf_document:
    toc: yes
---

# Exploratory Data Analysis (EDA)

![*The image above is from [R4DS(2e)](https://r4ds.hadley.nz) by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund*](data/data-science.png)

**EDA** is an iterative cycle that helps you understand what your data says. When you do EDA, you:

1.  Generate questions about your data

2.  Search for answers by visualising, transforming, and/or modeling your data

3.  Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: [EDA](https://posit.cloud/learn/primers/3.1))

## GDP and GDP per Capita

1.  GDP, PPP (constant 2017 international \$): NY.GDP.MKTP.PP.KD

2.  Population, total: SP.POP.TOTL

3.  Calculate GDP per Capita

    -   GDP per capita, PPP (constant 2017 international \$): NY.GDP.PCAP.PP.KD

-   GDP, PPP (constant 2017 international \$) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD

-   Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL

```{r cache = TRUE, eval = FALSE}
df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
```

```{r eval = FALSE}
write_csv(df_gdppcap, "data/gdppcap.csv")
```

```{r}
df_gdppcap <- read_csv("data/gdppcap.csv")
```

```{r}
str(df_gdppcap)
```

```{r}
df_gdppcap |> select(region, income, lending) |> lapply(unique)
```

```{r}
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
  ggplot(aes(year, gdppcap)) + geom_line()
```

```{r}
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
  ggplot(aes(year, pop)) + geom_line()
```

#### Exercise.

Write your observations and questions.

### GDP Per Capita

```{r}
df_gdppcap2 <- df_gdppcap |> drop_na(pop) |> 
  mutate(PCAP = gdp/pop, .after = gdppcap)
```

```{r}
df_gdppcap2
```

#### Check against GDP per capita, PPP

```{r eval = FALSE}
df_gdppcap2 |> drop_na(gdppcap, PCAP) |> mutate(near = near(gdppcap, PCAP)) |> 
  summarize(numberofdata = n(), sum(near))
```

```{r}
df_gdppcap2 |> filter(!near(gdppcap, PCAP))
```

#### Exercise.

Write your observations and questions.

### Visualization

Two useful questions.

1.  What type of **variation** occurs **within** my variables?

2.  What type of **covariation** occurs **between** my variables?

See [Link](https://posit.cloud/learn/primers/3.1).

#### Ranks.

`arrange(desc(gdp))` is to reorder in descending order of `gdp,` `arrange(gdp)` in ascending order.

```{r}
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> 
  drop_na(gdp) |> arrange(desc(gdp))
```

#### Exercises.

1.  Find the top 10 of the countries with the highest GDP per capita.

2.  Find the top 10 of the countries with the lowest GDP per capita.

3.  Find the top 10 of the countries with the largest population.

4.  Find the top 10 of the countries with the smallest population.

#### Scatter Plot

What type of **covariation** occurs **between** my variables?

```{r}
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp)) + geom_point()
```

```{r}
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp)) + geom_point() + 
  scale_x_log10() + scale_y_log10()
```

```{r}
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() + scale_y_log10()
```

```{r}
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp, color = region)) + geom_point() + 
  scale_x_log10() + scale_y_log10()
```

```{r}
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> 
  drop_na(gdp, pop) |> 
  ggplot(aes(pop, gdp, color = region, shape = income)) + geom_point() + 
  scale_x_log10() + scale_y_log10()
```

```{r}
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> 
  drop_na(gdp, gdppcap, pop) |> 
  ggplot(aes(gdppcap, gdp, color = region, size = pop)) + geom_point() + 
  scale_x_log10() + scale_y_log10()
```

```{r eval = FALSE}
install.packages("plotly")
```

```{r}
library(plotly)
test <- df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |> 
  ggplot(aes(color = country, shape = region, pop, gdp)) + geom_point() + 
  scale_x_log10() + scale_y_log10() + theme(legend.position = "none")
test |> ggplotly()
```

#### Variation 1. Histogram

```{r}
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |> 
  ggplot(aes(gdp)) + geom_histogram()
```

```{r}
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap)) + geom_histogram()
```

```{r}
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |> 
  ggplot(aes(gdp)) + geom_histogram() + scale_x_log10()
```

#### Exercises.

1.  Change bins, i.e., `geom_histogram(bins = 20)`, etc.

```{r eval = FALSE}
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |> 
  ggplot(aes(gdp)) + geom_histogram(bins = 20) + scale_x_log10()
```

2.  Create a similar histogram of gdppcap by using `scale_x_log10()` and adjust the number of bins.

```{r eval = FALSE}
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap)) + geom_histogram() + scale_x_log10()
```

3.  Create a similar histogram for population.
4.  Write your observations and comments.

#### Extra.

```{r}
df_gdppcap |> filter(year == 2022,region != "Aggregates") |> drop_na(pop) |> 
  group_by(region) |> 
  ggplot(aes(pop, fill = region)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()
```

#### Variation 2. Boxplot

```{r}
df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10()
```

```{r}
df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |> 
  ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10() +
  labs(title = "Distribution of the GDP per Capita of Countries", subtitle = "Year 1990, 2000, 2010, 2020", 
       y = "Year", x = "GDP per capita in log10 scale")
```

```{r}
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |> 
  filter(income != "Aggregates") |> 
  ggplot(aes(gdppcap, income, fill = income)) + geom_boxplot() + scale_x_log10() +
  theme(legend.position = "none")
```

```{r}
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |> 
  filter(income != "Aggregates") |> 
  ggplot(aes(gdppcap, factor(income, levels = c("High income", "Upper middle income", "Lower middle income", "Low income")), fill = income)) + geom_boxplot() + scale_x_log10() +
  labs(y = "") +
  theme(legend.position = "none")
```

```{r}
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |> 
  filter(income != "Aggregates") |> 
  ggplot(aes(gdp, region, fill = region)) + geom_boxplot() + scale_x_log10() +
  theme(legend.position = "none")
```

## CO2 Emissions Per Capita vs GDP Per Capita

1.  CO2 emissions (metric tons per capita): EN.ATM.CO2E.PC

2.  GDP per capita, PPP (constant 2017 international \$): NY.GDP.PCAP.PP.KD

-   CO2 emissions (metric tons per capita) Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring. EN.ATM.CO2E.PC

-   GDP per capita, PPP (constant 2017 international \$) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD

### Importing Data

```{r cache = TRUE, eval = FALSE}
df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
```

```{r eval = FALSE}
write_csv(df_co2gdp, "data/co2gdp.csv")
```

```{r}
df_co2gdp <- read_csv("data/co2gdp.csv")
```

### Visualization by Line Graphs

```{r}
COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |>
  ggplot(aes(year, co2pcap)) + geom_line()
```

```{r}
ISO2C <- c("JP", "CN", "ID", "UK", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
  ggplot(aes(year, co2pcap, linetype = iso2c)) + geom_line()
```

#### Exercises.

1.  Change `iso2c` codes to those you want to investigate. Use `df_codes` under Environment
2.  Change `linetype` to col.

### Scatterplot for Covariation

```{r}
df_co2gdp |> filter(year == 2020) |> drop_na(co2pcap) |>
  ggplot(aes(gdppcap, co2pcap)) + geom_point()
```

```{r}
df_co2gdp |> filter(year == 2020) |> 
  drop_na(gdppcap, co2pcap) |>
  ggplot(aes(gdppcap, co2pcap)) + geom_point() +
  scale_x_log10() + scale_y_log10()
```

#### Scatterplot with a regression line

```{r}
df_co2gdp |> filter(year == 2020) |> 
  drop_na(gdppcap, co2pcap) |>
  ggplot(aes(gdppcap, co2pcap)) + geom_point() +
  geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
  scale_x_log10() + scale_y_log10()
```

#### Summary of a linear model

```{r}
df_co2gdp |> filter(year == 2020) |> drop_na(gdppcap, co2pcap) |>
  lm(log10(co2pcap)~log10(gdppcap), data = _) |> summary()
```

## School Enrollment vs GDP Per Capita

1.  School enrollment, secondary (% gross): SE.SEC.ENRR

2.  GDP per capita, PPP (constant 2017 international \$): NY.GDP.PCAP.PP.KD

-   School enrollment, secondary (% gross) Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers. SE.SEC.ENRR

-   GDP per capita, PPP (constant 2017 international \$) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD

### Importing Data

```{r cache = TRUE, eval = FALSE}
df_secgdp <- WDI(indicator = c(sec = "SE.SEC.ENRR", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
```

```{r eval = FALSE}
write_csv(df_secgdp, "data/secgdp.csv")
```

```{r}
df_secgdp <- read_csv("data/secgdp.csv")
```

### Visualization by Line Graphs

```{r}
COUNTRY <- "World"
df_secgdp |> filter(country == COUNTRY) |>
  ggplot(aes(year, sec)) + geom_line()
```

```{r}
COUNTRIES <- c("Low income", "Low & middle income", "Lower middle income", "Middle income", "Upper middle income", "High income")
df_secgdp |> filter(country %in% COUNTRIES) |> drop_na(sec) |>
  ggplot(aes(year, sec, linetype = factor(country, levels = COUNTRIES))) + geom_line() +
  labs(linetype = "Income Levels")
```

#### Exercise.

Change `COUNTRIES` to `ISO2C` of countries you want to investigate. Use `df_codes` under Environment

```{r}
df_secgdp |> filter(year == 2020) |> drop_na(sec) |>
  ggplot(aes(gdppcap, sec)) + geom_point()
```

```{r}
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
  ggplot(aes(gdppcap, sec)) + geom_point() +
  scale_x_log10()
```

```{r}
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
  ggplot(aes(gdppcap, sec)) + geom_point() +
  geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
  scale_x_log10()
```

```{r}
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
  lm(sec~log10(gdppcap), data = _) |> summary()
```

# Your Turn

## Exercises.

Do a similar investigation by selecting WDI codes.

### WDI Code

Choose at least two WDI codes with their names

1.  Name: Code:

2.  Name: Code:

### Importing Data

Replace the following data_frame_name and shortname1, shortname2.

```{r eval = FALSE}
df_dataframe_name <- WDI(indicator = c(shortname1 = "", shortname2 = "", extra = TRUE))
```

```{r eval = FALSE}
write_csv(df_dataframe_name, "data/dataframe_name.csv")
```

```{r eval = FALSE}
df_dataframe_name <- read_csv("data/dataframe_name.csv")
```

### Viewing Data

`head()`, `str()`, `summary()`, and try `df_dataframe_name`

### Visualization

Try as many visualization as possible.

-   rank for each variable

-   line graph

-   scatterplot

-   scatterplot with a regression line

-   histogram

-   boxplot

## References

1.  R for Data Science (2e): [Link](https://r4ds.hadley.nz). The First Edition: [Link](https://r4ds.had.co.nz).

2.  Posit Primers: [Link](https://posit.cloud/learn/primers).

3.  Cheat Sheet: [Link](https://posit.cloud/learn/cheat-sheets).
